{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "84a147df",
   "metadata": {},
   "source": [
    "# AutoMM Detection - Quick Start with Foundation Model on Open Vocabulary Detection (OVD)\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/object_detection/quick_start/quick_start_ovd.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/object_detection/quick_start/quick_start_ovd.ipynb)\n",
    "\n",
    "\n",
    "In this section, our goal is to use a foundation model in object detection to detect novel classes defined by an unbounded (open) vocabulary.\n",
    "\n",
    "## Setting up the imports\n",
    "To start, let's import MultiModalPredictor, and also make sure groundingdino is installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6122a63c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "348ce934",
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import groundingdino\n",
    "except ImportError:\n",
    "    try:\n",
    "        from pip._internal import main as pipmain\n",
    "    except ImportError:\n",
    "        from pip import main as pipmain  # for old pip version\n",
    "    pipmain(['install', '--user', 'git+https://github.com/FANGAreNotGnu/GroundingDINO.git'])  # equals to \"!pip install git+https://github.com/IDEA-Research/GroundingDINO.git\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5481f3b2",
   "metadata": {},
   "source": [
    "## Prepare sample image\n",
    "Let's use an image of Seattle's street view to demo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff4dd355",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image, display\n",
    "from autogluon.multimodal import download\n",
    "\n",
    "sample_image_url = \"https://live.staticflickr.com/65535/49004630088_d15a9be500_6k.jpg\"\n",
    "sample_image_path = download(sample_image_url)\n",
    "\n",
    "display(Image(filename=sample_image_path))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "dfc32ec3",
   "metadata": {},
   "source": [
    "## Creating the MultiModalPredictor\n",
    "We create the MultiModalPredictor and specify the problem_type to `\"open_vocabulary_object_detection\"`.\n",
    "We set the preset as `\"best_quality\"`, which uses a SwinB as backbone. This preset gives us higher accuracy for detection. \n",
    "We also provide presets `\"high_quality\"` and `\"medium_quality\"` with SwinT as backbone, faster but also with lower performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "228bd148",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Init predictor\n",
    "predictor = MultiModalPredictor(problem_type=\"open_vocabulary_object_detection\", presets = \"best_quality\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1de5ce66",
   "metadata": {},
   "source": [
    "## Inference\n",
    "To run inference on the image, perform:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ce6e910",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = predictor.predict(\n",
    "    {\n",
    "        \"image\": [sample_image_path],\n",
    "        \"prompt\": [\"Pink notice. Green sign. One Way sign. People group. Tower crane in construction. Lamp post. Glass skyscraper.\"],\n",
    "    },\n",
    "    as_pandas=True,\n",
    ")\n",
    "\n",
    "print(pred)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "aa6df9d9",
   "metadata": {},
   "source": [
    "The output `pred` is a `pandas` `DataFrame` that has two columns, `image` and `bboxes`.\n",
    "\n",
    "In `image`, each row contains the image path\n",
    "\n",
    "In `bboxes`, each row is a list of dictionaries, each one representing the prediction for an object in the image: `{\"class\": <predicted_class_name>, \"bbox\": [x1, y1, x2, y2], \"score\": <confidence_score>}`, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b46b979",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(pred[\"bboxes\"][0][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3c26f03",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install opencv-python"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8221cc2b",
   "metadata": {},
   "source": [
    "To visualize results, run the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c2c5c04",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.multimodal.utils import ObjectDetectionVisualizer\n",
    "\n",
    "conf_threshold = 0.2  # Specify a confidence threshold to filter out unwanted boxes\n",
    "image_result = pred.iloc[0]\n",
    "\n",
    "img_path = image_result.image  # Select an image to visualize\n",
    "\n",
    "visualizer = ObjectDetectionVisualizer(img_path)  # Initialize the Visualizer\n",
    "out = visualizer.draw_instance_predictions(image_result, conf_threshold=conf_threshold)  # Draw detections\n",
    "visualized = out.get_image()  # Get the visualized image\n",
    "\n",
    "from PIL import Image\n",
    "from IPython.display import display\n",
    "img = Image.fromarray(visualized, 'RGB')\n",
    "display(img)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa58b518",
   "metadata": {},
   "source": [
    "## Other Examples\n",
    "\n",
    "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n",
    "\n",
    "## Customization\n",
    "To learn how to customize AutoMM, please refer to [Customize AutoMM](../../advanced_topics/customization.ipynb).\n",
    "\n",
    "## Citation"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0c8dfcc8",
   "metadata": {},
   "source": [
    "```\n",
    "@misc{liu2023grounding,\n",
    "      title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection}, \n",
    "      author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},\n",
    "      year={2023},\n",
    "      eprint={2303.05499},\n",
    "      archivePrefix={arXiv},\n",
    "      primaryClass={cs.CV}\n",
    "}\n",
    "```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "det116",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  },
  "vscode": {
   "interpreter": {
    "hash": "0c3764fa02e1f76b12dc0a2359f5229c043b8e8c53abd2abaacd8712f272c448"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
